Joeyonng
  • Notebook
  • Pages
  • About
  • Backyard
  1. Learning Theory
  2. 31  Bayesian Classifier
  • Welcome
  • Notations and Facts
  • Linear Algebra
    • 1  Fields and Spaces
    • 2  Vectors and Matrices
    • 3  Span and Linear Independence
    • 4  Basis and Dimension
    • 5  Linear Map and Rank
    • 6  Inner Product and Norm
    • 7  Orthogonality and Unitary Matrix
    • 8  Complementary Subspaces and Projection
    • 9  Orthogonal Complement and Decomposition
    • 10  SVD and Pseudoinverse
    • 11  Orthogonal and Affine Projection
    • 12  Determinants and Eigensystems
    • 13  Similarity and Diagonalization
    • 14  Normal and Hermitian Matrices
    • 15  Positive Definite Matrices
  • Calculus
    • 16  Derivatives
    • 17  Chain rule
  • Probability and Statistics
    • 18  Probability
    • 19  Random Variables
    • 20  Expectation
    • 21  Common Distributions
    • 22  Moment Generating Function
    • 23  Concentration Inequalities I
    • 24  Convergence
    • 25  Limit Theorems
    • 26  Maximum Likelihood Estimation
    • 27  Bayesian Estimation
    • 28  Expectation-maximization
    • 29  Concentration Inequalities II
  • Learning Theory
    • 30  Statistical Learning
    • 31  Bayesian Classifier
    • 32  Effective Class Size
    • 33  Empirical Risk Minimization
    • 34  Uniform Convergence
    • 35  PAC Learning
    • 36  Rademacher Complexity
  • Machine Learning
    • 37  Linear Discriminant
    • 38  Perceptron
    • 39  Logistic Regression
    • 40  Multi-layer Perceptron
    • 41  Boosting
    • 42  Support Vector Machine
    • 43  Decision Tree
    • 44  Principle Component Analysis
  • Deep Learning
    • 45  Transformer

Table of contents

  • MAP rule
  • MAP rule for binary classification
    • Regression function
    • Excess risk
  1. Learning Theory
  2. 31  Bayesian Classifier

31  Bayesian Classifier

Bayes classifier is the particular hypothesis h^{*} (x) that minimizes the risk

h^{*} = \argmin_{h} R (h)

and the risk that Bayes decision rule achieves is called Bayes Risk,

R^{*} = R (h^{*})

which is the minimum risk that any hypothesis can achieve if we know the true probabilities \mathbb{P}_{X, Y}.

MAP rule

Since the true risk R (h) does not depend on X and Lemma 30.1 shows that R(h) = \mathbb{E}_{X} \left[ \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right] \right], the Bayes classifier can also be written as the hypothesis that minimizes the conditional expectation

h^{*} = \argmin_{h} \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right],

which can be further simplied to maximum a-posteriori probability (MAP) rule if the loss function is 0-1 loss and there are m labels y \in [1, m]

h^{*} (x) = \argmax_{y \in [1, m]} \mathbb{P}_{Y \mid X} (y \mid x)

Proof

According to the definition of Bayes classifier,

\begin{aligned} h^{*} & = \argmin_{h} \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right] \\ & = \argmin_{h} \sum_{y=1}^{m} \mathbb{P}_{Y \mid X} (y \mid x) L (h, y) & [\text{def of } \mathbb{E}_{Y \mid X}] \\ & = \argmin_{h} \sum_{y = h (x)}^{m} \mathbb{P}_{Y \mid X} (y \mid x) \times 0 + \sum_{y \neq h (x)}^{m} \mathbb{P}_{Y \mid X}(y \mid x) \times 1 & [\text{def of 0-1 loss}] \\ & = \argmin_{h} \sum_{y \neq h (x)}^{m} \mathbb{P}_{Y \mid X} (y \mid x) \\ & = \argmin_{h} \left[ 1 - \mathbb{P}_{Y \mid X} (h (x) \mid x) \right] & [\sum_{x \neq \alpha} \mathbb{P}_{X} (x) = 1 - \mathbb{P}_{X} (\alpha) ] \\ & = \argmax_{h} \mathbb{P}_{Y \mid X} (h (x) \mid x) & [\argmin_{x} (1 - f(x)) = \arg\max_{x} (f(x))]. \end{aligned}

where the last line can be simplied to

h^{*} (x) = \argmax_{y \in [1, m]} \mathbb{P}_{Y \mid X} (y \mid x)

since h (x) \in [1, m].

According to Bayes Theorem,

\begin{aligned} \arg\max_{y} \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) & = \arg\max_{y} \frac{\mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y)}{\mathbb{P}_{\mathbf{X}}(\mathbf{x})} \\ & = \arg\max_{y} \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y) & [\mathbb{P}_{\mathbf{X}}(\mathbf{x}) \text{ doesn't depend on } y], \\ \end{aligned}

MAP rule can thus be computed using the class conditional probability (likelihood) and the class probability (prior), which is more practical since the class conditional probability and class probability can be more easily obtained from the data than the posterior probability.

Using the log trick, the BDR for 0-1 loss is often calculated using:

\begin{aligned} \arg\max_{y} \ln \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) & = \arg\max_{y} \ln \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y) \\ & = \arg\max_{y} \ln \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) + \ln \mathbb{P}_{Y}(y). \\ \end{aligned}

MAP rule for binary classification

Since there are only 2 labels in the binary classification problem, the MAP rule for binary classification is simplied to

\begin{aligned} h^{*} (x) & = \argmax_{y \in [0, 1]} \mathbb{P}_{Y \mid X} (y \mid x) \\ & = \mathbb{1} \left[ \mathbb{P}_{Y \mid X} (1 \mid x) > \mathbb{P}_{Y \mid X} (0 \mid x) \right] \\ & = \begin{cases} 1, \quad \mathbb{P}_{Y \mid X} (1 \mid x) > \frac{ 1 }{ 2 } \\ 0, \quad \mathbb{P}_{Y \mid X} (1 \mid x) < \frac{ 1 }{ 2 }. \end{cases} \end{aligned}

Regression function

The conditional distribution \mathbb{P}_{Y \mid X} can be modeled with a Bernoulli distribution \mathbb{P}_{Y \mid X} (y \mid x) = \mathrm{Ber} (\eta (x)), where \eta (x) is the regression function

\eta (x) = \mathbb{P}_{Y \mid X} (1 \mid x) = \mathbb{E}_{Y \mid X} (Y).

Lemma 31.1 For any hypothesis h, we can write its risk function with 0-1 loss for binary classification as

R (h) = \mathbb{E}_{X} \left[ \eta (X) (1 - h (X)) + (1 - \eta (X)) h (X) \right].

Proof

According to the definition of the risk function

R (h) = \mathbb{E}_{X} \left[ \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right] \right].

Since the 0-1 loss for binary classification problem can be written as

L (h (x), y) = \mathbb{1} \left[ h (x) \neq y \right] = y (1 - h (x)) + (1 - y) h (x)

we have

\begin{aligned} R (h) & = \mathbb{E}_{X} \left[ \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right] \right] \\ & = \mathbb{E}_{X} \left[ \mathbb{E}_{Y \mid X} \left[ y (1 - h (x)) + (1 - y) h (x) \right] \right] \\ & = \mathbb{E}_{X} \left[ \mathbb{E}_{Y \mid X} [y](1 - h (x)) + \mathbb{E}_{Y \mid X} [1 - y] h (x) \right] \\ & = \mathbb{E}_{X} \left[ \eta (X) (1 - h (X)) + (1 - \eta (X)) h (X) \right]. \end{aligned}

Theorem 31.1 The risk of the Bayes classifier for binary classification with 0-1 loss is the expectation of the minimum of \eta (X) and 1 - \eta (X)

R (h^{*}) = \mathbb{E}_{X} \left[ \min \left[ \eta (X), 1 - \eta (X) \right] \right]

and is less than \frac{ 1 }{ 2 }

\mathbb{E}_{X} \left[ \min \left[ \eta (X), 1 - \eta (X) \right] \right] \leq \frac{ 1 }{ 2 }.

Proof

By applying Lemma 31.1 and replacing h with the Bayes classifier h^{*}, we have

\begin{aligned} R (h^{*}) & = \mathbb{E}_{X} \left[ \eta (X) (1 - h^{*} (X)) + (1 - \eta (X)) h^{*} (X) \right] \\ & = \mathbb{E}_{X} \left[ \eta (X) \mathbb{1} \left[ \eta (X) < \frac{ 1 }{ 2 } \right] + (1 - \eta (X)) \mathbb{1} \left[ \eta (X) > \frac{ 1 }{ 2 } \right] \right] \\ & = \mathbb{E}_{X} \left[ \min \left[ \eta (X), 1 - \eta (X) \right] \right] \end{aligned}

where the last inequality follows because

R (h^{*}) = \mathbb{E}_{X} [\eta (X)] = \mathbb{E}_{X} [\min [\eta (X), 1 - \eta (X)]], \quad \text{ if } \eta (X) < \frac{ 1 }{ 2 } \\ R (h^{*}) = \mathbb{E}_{X} [1 - \eta (X)] = \mathbb{E}_{X} [\min [\eta (X), 1 - \eta (X)]], \quad \text{ if } \eta (X) > \frac{ 1 }{ 2 }.

Since \min [\eta (X), 1 - \eta (X)] < \frac{ 1 }{ 2 }, its expectation is also less than \frac{ 1 }{ 2 }

\mathbb{E}_{X} [\min [\eta (X), 1 - \eta (X)]] < \frac{ 1 }{ 2 }.

Excess risk

For any hypothesis h, we are interested in the difference between its risk R (h) and Bayes risk R (h^{*}), which is called excess risk of h

\mathcal{E} (h) = R (h) - R (h^{*}).

Theorem 31.2 For any hypothesis h, the excess risk satisfies

\mathcal{E} (h) = \mathbb{E}_{X} \left[ \lvert 2 \eta (X) - 1 \rvert \times \mathbb{1} \left[ h (X) \neq h^{*} (X) \right] \right]

Proof

By applying Lemma 31.1 for R (h) and R (h^{*}) and linearity of expectation, we have

\begin{aligned} R (h) - R(h^{*}) & = \mathbb{E}_{X} \left[ \eta (X) (h^{*} (X) - h (X)) + (1 - \eta (X)) (h (X) - h^{*} (X)) \right] \\ & = \mathbb{E}_{X} \left[ (2 \eta (X) - 1) (h^{*} (X) - h (X)) \right]. \end{aligned}

Note that

h^{*} (X) - h (X) = \mathrm{sgn} [2 \eta (X) - 1] \times \mathbb{1} [h^{*} \neq h (X)],

because it combines all 3 cases for the results of h^{*} (X) - h (X).

  1. If h^{*} (X) = h (X),

    h^{*} (X) - h (X) = 0.

  2. Since \eta (X) > \frac{ 1 }{ 2 } \implies h^{*} (X) = 1, if h^{*} (X) = 1, h (X) = 0,

    h^{*} (X) - h (X) = 1 = \mathrm{sgn} [2 \eta (X) - 1].

  3. Since \eta (X) < \frac{ 1 }{ 2 } \implies h^{*} (X) = 0, if h^{*} (X) = 0, h (X) = 1,

    h^{*} (X) - h (X) = -1 = \mathrm{sgn} [2 \eta (X) - 1].

Therefore,

\begin{aligned} R (h) - R(h^{*}) & = \mathbb{E}_{X} \left[ (2 \eta (X) - 1) \mathrm{sgn} [2 \eta (X) - 1] \times \mathbb{1} [h^{*} \neq h (X)], \right] \\ & = \mathbb{E}_{X} \left[ \lvert 2 \eta (X) - 1 \rvert \times \mathbb{1} [h^{*} \neq h (X)] \right]. \end{aligned}

where the last equality holds since x \times \mathrm{sgn} [x] = \lvert x \rvert.

30  Statistical Learning
32  Effective Class Size